Search terms: (spike[Title] OR “S gene”[Title] OR “S protein”[Title] OR “S glycoprotein”[Title] OR “S1 gene”[Title] OR “S1 protein”[Title] OR “S1 glycoprotein”[Title] OR peplomer[Title] OR peplomeric[Title] OR peplomers[All Title] OR “complete genome”[Title]) NOT (patent[Title] OR vaccine OR artificial OR construct OR recombinant[Title])
| host recognised | no hosts | |
|---|---|---|
| no spike sequence | 64 | 947 |
| spike sequence available | 54 | 520 |
| . | Freq |
|---|---|
| 1 | 454 |
| 2 | 52 |
| 3 | 7 |
| 4 | 5 |
| 5 | 2 |
| 6 | 5 |
| 7 | 1 |
| 8 | 2 |
| 10 | 2 |
| 11 | 1 |
| 12 | 3 |
| 13 | 2 |
| 14 | 1 |
| 16 | 2 |
| 17 | 2 |
| 19 | 2 |
| 23 | 1 |
| 26 | 2 |
| 27 | 1 |
| 30 | 1 |
| 31 | 2 |
| 32 | 1 |
| 33 | 1 |
| 35 | 1 |
| 43 | 1 |
| 51 | 1 |
| 54 | 1 |
| 60 | 1 |
| 66 | 1 |
| 71 | 1 |
| 75 | 1 |
| 86 | 1 |
| 150 | 1 |
| 172 | 1 |
| 183 | 1 |
| 361 | 1 |
| 393 | 1 |
| 660 | 1 |
| 679 | 1 |
| 739 | 1 |
| 753 | 1 |
| 844 | 1 |
| 991 | 1 |
| 3514 | 1 |
| 5953 | 1 |
| childtaxa_name | n_seqs | |
|---|---|---|
| 1581 | Feline coronavirus | 753 |
| 1582 | Severe acute respiratory syndrome coronavirus 2 | 844 |
| 1583 | Middle East respiratory syndrome-related coronavirus | 991 |
| 1584 | Porcine epidemic diarrhea virus | 3514 |
| 1585 | Infectious bronchitis virus | 5953 |
Considering only the 574 coronaviruses with available spike protein sequence data…
Number of host species per coronavirus also heavily skewed, as expected:| . | Freq |
|---|---|
| 0 | 520 |
| 1 | 38 |
| 2 | 3 |
| 3 | 4 |
| 5 | 2 |
| 6 | 1 |
| 8 | 1 |
| 15 | 1 |
| 18 | 1 |
| 26 | 1 |
| 38 | 1 |
| 48 | 1 |
| childtaxa_name | Hostspp | |
|---|---|---|
| 570 | Severe acute respiratory syndrome-related coronavirus | 15 |
| 571 | Alphacoronavirus 1 | 18 |
| 572 | Betacoronavirus 1 | 26 |
| 573 | Bat coronavirus | 38 |
| 574 | Avian coronavirus | 48 |
Coronaviruses with broadest host range include very wide species that encompass many individual strains..
Number of coronaviruses per host species also heavily skewed:| . | Freq |
|---|---|
| 0 | 1263 |
| 1 | 124 |
| 2 | 20 |
| 3 | 4 |
| 4 | 2 |
| 5 | 2 |
| 7 | 1 |
| 23 | 1 |
| . | Freq | |
|---|---|---|
| 1412 | mustela putorius | 4 |
| 1413 | rattus norvegicus | 4 |
| 1414 | sus scrofa | 5 |
| 1415 | vicugna pacos | 5 |
| 1416 | homo sapiens | 7 |
| 1417 | rhinolophus sinicus | 23 |
As expected, some commonly studied species (ferret, rat, domestic pig, human), plus livestock (alpaca) and one horseshoe bat, sequences mostly derive from a single study
Number of coronaviruses infecting each host group:
Host groups are mutually exclusive, i.e. primates = non-human primates. Other mammals = misc orders (Proboscidea, Eulipotyphla, Cingulata..)
Not too sure this is very meaningful given how little we know about potential animal hosts of coronaviruses
| Var1 | Freq |
|---|---|
| complete_spike | 0.1479955 |
| partial_spike | 0.6415494 |
| whole_genome | 0.2104551 |
While most sequences are only partial, most coronaviruses with at least 1 sequence actually have at least 1 complete spike sequence (420 of 574).
0.15% of coding sequences do not have a protein label annotation in metadata and 32.32% do not have a gene label annotation - using regular expressions based on proteins to determine whether each coding sequence is spike (may be entire S, or just S1 subunit, S2 subunit) or other:
| complete_spike | partial_spike | whole_genome | |
|---|---|---|---|
| other | 1384 | 440 | 30574 |
| S | 2415 | 5462 | 3557 |
| S1 | 15 | 5244 | 0 |
| S2 | 11 | 47 | 0 |
Excluding partial sequences, summaries of counts of different complete spike protein sequence types per coronavirus (taxid):
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 10 | 12 | 13 | 14 | 16 | 17 | 19 | 21 | 26 | 27 | 28 | 31 | 33 | 42 | 43 | 44 | 46 | 47 | 53 | 87 | 174 | 213 | 282 | 510 | 709 | 831 | 1888 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 312 | 49 | 7 | 5 | 5 | 4 | 1 | 2 | 2 | 1 | 3 | 2 | 1 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
So 5972 individual spike protein sequences across 420 viruses.
NB further complete coding sequences are available that give subunits separately - for S1, 1 virus (Infectious bronchitis virus) and for S2, 2 virus (Infectious bronchitis virus, Porcine epidemic diarrhea virus). Not considering these for now.
Excluding partial sequences, summaries of genomic characteristics per coronavirus (taxid) (i.e. values are averaged within each virus so that each virus represents only one data point):
Only for viruses that have whole genome sequences, mean lengths:
Lengths, ENC, GC content between-coronaviruses in spike and other proteins; within-coronaviruses in spike:
Analysis of Variance Table
Response: length
Df Sum Sq Mean Sq F value Pr(>F)
taxid 419 372462761 888933 154.54 < 2.2e-16 ***
Residuals 5552 31934775 5752
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fairly consistent sequence lengths of spikes compared to other proteins (expected as pooling all others). Very little within-coronavirus variation.
Analysis of Variance Table
Response: enc
Df Sum Sq Mean Sq F value Pr(>F)
taxid 419 50902 121.484 361.18 < 2.2e-16 ***
Residuals 5552 1867 0.336
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Stronger average codon bias in spike than other proteins! Reasonable variation in spike codon biases between-coronaviruses and within some coronaviruses. Human CoV HKU1 more strongly biased than other CoVs.
Analysis of Variance Table
Response: G + C
Df Sum Sq Mean Sq F value Pr(>F)
taxid 419 149675245 357220 311.7 < 2.2e-16 ***
Residuals 5552 6362848 1146
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
GC content slightly lower and slightly more uniform in spikes than in other proteins! Some variation between-coronaviruses, and some variation within-coronaviruses. Human CoV HKU1 and Wencheng shrew Cov more strongly biased than other CoVs.
Mean GC content of spike versus known host range count, labelled as human/nonhuman virusNot too informative though useful to see which virus is which?
Dinucleotide biases do vary in scale - clearly some biases present (TG overrepresented, CG underepresented). But these are pretty consistent between genera.
Reassuring - biases are more extreme at bridge (3-1) dinucleotides as expected. TG, TA, CA overrepresented, GT, GA, AT underepresented. Still sufficient variability to look for signal in
Most obvious thing is use of different stop codons. But otherwise, fairly consistent across genera agaih..
Not convinced amino acid bias is really useful here - it’s just proportion amino acids in the protein sequence, and it’ll be fairly consistent between CoVs..